1,965 research outputs found
Stratification bias in low signal microarray studies
BACKGROUND:
When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and test sets are negatively correlated.
RESULTS:
We show that when estimating the performance of classifiers on low signal datasets (i.e. those which are difficult to classify), which are typical of many prognostic microarray studies, commonly used performance measures can suffer from a substantial negative bias. For error rate this bias is only severe in quite restricted situations, but can be much larger and more frequent when using ranking measures such as the receiver operating characteristic (ROC) curve and area under the ROC (AUC). Substantial biases are shown in simulations and on the van 't Veer breast cancer dataset. The classification error rate can have large negative biases for balanced datasets, whereas the AUC shows substantial pessimistic biases even for imbalanced datasets. In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5. Further experiments on the van 't Veer breast cancer dataset show these biases exist in practice.
CONCLUSION:
Stratification bias can substantially affect several performance measures. In computing the AUC, the strategy of pooling the test samples from the various folds of cross-validation can lead to large biases; computing it as the average of per-fold estimates avoids this bias and is thus the recommended approach. As a more general solution applicable to other performance measures, we show that stratified repeated holdout and a modified version of k-fold cross-validation, balanced, stratified cross-validation and balanced leave-one-out cross-validation, avoids the bias. Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided. In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets
Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation
We present an investigation of recently proposed character and word sequence kernels for the task of authorship attribution based on relatively short texts. Performance is compared with two corresponding probabilistic approaches based on Markov chains. Several configurations of the sequence kernels are studied on a relatively large dataset (50 authors), where each author covered several topics. Utilising Moffat smoothing, the two probabilistic approaches obtain similar performance, which in turn is comparable to that of character sequence kernels and is better than that of word sequence kernels. The results further suggest that when using a realistic setup that takes into account the case of texts which are not written by any hypothesised authors, the amount of training material has more influence on discrimination performance than the amount of test material. Moreover, we show that the recently proposed author unmasking approach is less useful when dealing with short texts
Investigating the Encoding of Words in BERT's Neurons using Feature Textualization
Pretrained language models (PLMs) form the basis of most state-of-the-art NLP
technologies. Nevertheless, they are essentially black boxes: Humans do not
have a clear understanding of what knowledge is encoded in different parts of
the models, especially in individual neurons. The situation is different in
computer vision, where feature visualization provides a decompositional
interpretability technique for neurons of vision models. Activation
maximization is used to synthesize inherently interpretable visual
representations of the information encoded in individual neurons. Our work is
inspired by this but presents a cautionary tale on the interpretability of
single neurons, based on the first large-scale attempt to adapt activation
maximization to NLP, and, more specifically, large PLMs. We propose feature
textualization, a technique to produce dense representations of neurons in the
PLM word embedding space. We apply feature textualization to the BERT model
(Devlin et al., 2019) to investigate whether the knowledge encoded in
individual neurons can be interpreted and symbolized. We find that the produced
representations can provide insights about the knowledge encoded in individual
neurons, but that individual neurons do not represent clearcut symbolic units
of language such as words. Additionally, we use feature textualization to
investigate how many neurons are needed to encode words in BERT.Comment: To be published in 'BlackboxNLP 2023: The 6th Workshop on Analysing
and Interpreting Neural Networks for NLP'. Camera-ready versio
Genetic Diversity, Latency and Co-Infections
Alphaherpesviruses are highly prevalent in equine populations and co-
infections with more than one of these viruses’ strains frequently diagnosed.
Lytic replication and latency with subsequent reactivation, along with new
episodes of disease, can be influenced by genetic diversity generated by
spontaneous mutation and recombination. Latency enhances virus survival by
providing an epidemiological strategy for long-term maintenance of divergent
strains in animal populations. The alphaherpesviruses equine herpesvirus 1
(EHV-1) and 9 (EHV-9) have recently been shown to cross species barriers,
including a recombinant EHV-1 observed in fatal infections of a polar bear and
Asian rhinoceros. Little is known about the latency and genetic diversity of
EHV-1 and EHV-9, especially among zoo and wild equids. Here, we report
evidence of limited genetic diversity in EHV-9 in zebras, whereas there is
substantial genetic variability in EHV-1. We demonstrate that zebras can be
lytically and latently infected with both viruses concurrently. Such a co-
occurrence of infection in zebras suggests that even relatively slow-evolving
viruses such as equine herpesviruses have the potential to diversify rapidly
by recombination. This has potential consequences for the diagnosis of these
viruses and their management in wild and captive equid populations. View Full-
Tex
Quantitative Disentanglement of the Spin Seebeck, Proximity-Induced, and Ferromagnetic-Induced Anomalous Nernst Effect in Normal-Metal-Ferromagnet Bilayers
We identify and investigate thermal spin transport phenomena in
sputter-deposited Pt/NiFeO () bilayers. We
separate the voltage generated by the spin Seebeck effect from the anomalous
Nernst effect contributions and even disentangle the intrinsic anomalous Nernst
effect (ANE) in the ferromagnet (FM) from the ANE produced by the Pt that is
spin polarized due to its proximity to the FM. Further, we probe the dependence
of these effects on the electrical conductivity and the band gap energy of the
FM film varying from nearly insulating NiFeO to metallic
NiFe. A proximity-induced ANE could only be identified in the
metallic Pt/NiFe bilayer in contrast to Pt/NiFeO
() samples. This is verified by the investigation of static magnetic
proximity effects via x-ray resonant magnetic reflectivity
Ferritin H deficiency deteriorates cellular iron handling and worsens Salmonella typhimurium infection by triggering hyperinflammation
Iron is an essential nutrient for mammals as well as for pathogens. Inflammation-driven changes in systemic and cellular iron homeostasis are central for host-mediated antimicrobial strategies. Here, we studied the role of the iron storage protein ferritin H (FTH) for the control of infections with the intracellular pathogen Salmonella enterica serovar Typhimurium by macrophages. Mice lacking FTH in the myeloid lineage (LysM-Cre+/+Fthfl/fl mice) displayed impaired iron storage capacities in the tissue leukocyte compartment, increased levels of labile iron in macrophages, and an accelerated macrophage-mediated iron turnover. While under steady-state conditions, LysM-Cre+/+Fth+/+ and LysM-Cre+/+Fthfl/fl animals showed comparable susceptibility to Salmonella infection, i.v. iron supplementation drastically shortened survival of LysM-Cre+/+Fthfl/fl mice. Mechanistically, these animals displayed increased bacterial burden, which contributed to uncontrolled triggering of NF-κB and inflammasome signaling and development of cytokine storm and death. Importantly, pharmacologic inhibition of the inflammasome and IL-1β pathways reduced cytokine levels and mortality and partly restored infection control in iron-treated ferritin-deficient mice. These findings uncover incompletely characterized roles of ferritin and cellular iron turnover in myeloid cells in controlling bacterial spread and for modulating NF-κB and inflammasome-mediated cytokine activation, which may be of vital importance in iron-overloaded individuals suffering from severe infections and sepsis
Performance of the first prototype of the CALICE scintillator strip electromagnetic calorimeter
A first prototype of a scintillator strip-based electromagnetic calorimeter
was built, consisting of 26 layers of tungsten absorber plates interleaved with
planes of 45x10x3 mm3 plastic scintillator strips. Data were collected using a
positron test beam at DESY with momenta between 1 and 6 GeV/c. The prototype's
performance is presented in terms of the linearity and resolution of the energy
measurement. These results represent an important milestone in the development
of highly granular calorimeters using scintillator strip technology. This
technology is being developed for a future linear collider experiment, aiming
at the precise measurement of jet energies using particle flow techniques
Resource-aware Research on Universe and Matter: Call-to-Action in Digital Transformation
Given the urgency to reduce fossil fuel energy production to make climate
tipping points less likely, we call for resource-aware knowledge gain in the
research areas on Universe and Matter with emphasis on the digital
transformation. A portfolio of measures is described in detail and then
summarized according to the timescales required for their implementation. The
measures will both contribute to sustainable research and accelerate scientific
progress through increased awareness of resource usage. This work is based on a
three-days workshop on sustainability in digital transformation held in May
2023.Comment: 20 pages, 2 figures, publication following workshop 'Sustainability
in the Digital Transformation of Basic Research on Universe & Matter', 30 May
to 2 June 2023, Meinerzhagen, Germany, https://indico.desy.de/event/3748
- …